Skip to content

Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)#1528

Open
xiehuanyi wants to merge 2 commits intoopenai:mainfrom
xiehuanyi:submission/s2048-4h-a100-1.1104
Open

Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)#1528
xiehuanyi wants to merge 2 commits intoopenai:mainfrom
xiehuanyi:submission/s2048-4h-a100-1.1104

Conversation

@xiehuanyi
Copy link
Copy Markdown

@xiehuanyi xiehuanyi commented Apr 10, 2026

Summary

UPDATED 2026-04-11: Replaces the earlier 1.1104 BPB result with a much stronger 1.07035 BPB (TTT) / 1.07266 (sliding) using the exact PR #1493 SOTA recipe (SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + MuonEq-R + SDClip GPTQ + Brotli + legal score-first TTT), adapted for 1×A100 instead of 8×H100.

Beats upstream main-leaderboard SOTA:

Still non-record because the run was on 1×A100 for 4h (≈80 H100-minute-equivalent of raw BF16 throughput, but not on the required hardware and without FA3).

What's in this PR

The training script is the decompressed PR #1493 train_gpt.py (their LZMA+base85 one-liner) with three minimal adaptations for Ampere + Python 3.9:

  1. FA3 → FA2/SDP fallback. A100 doesn't support FlashAttention-3. The attention wrapper now tries flash_attn (FA2) first, then falls through to PyTorch scaled_dot_product_attention with the flash backend. The SDP path adds a manual GQA head-repeat (PyTorch SDP doesn't natively support num_heads != num_kv_heads).
  2. Python 3.9 compat. Removed zip(strict=True) and nested double-quoted f-strings.
  3. GRAD_ACCUM_STEPS env override. Added so single-GPU runs can override the default 8 // world_size. Not actually used in this submission (defaults kept), but left in for flexibility.

Everything else is identical to PR #1493: SP8192 vocab, 11L×512d×8H/4KV, MLP 4x, depth recurrence looping layers 3-5 (17 virtual from 11 physical, activated at frac=0.35), parallel residuals layer 7+, QK-Gain 5.25, skip gates, MuonEq-R + AdamW, WD=0.095, EMA=0.9965, warmdown_frac=0.72, matrix_lr=0.022, GPTQ SDClip (k=12.85 mats / k=20.0 embs), int6 attn+mlp / int8 embs, Brotli-11 + byte shuffle, legal score-first TTT (SGD lr=0.005 mom=0.9, 3 epochs/32K chunk).

Numbers (seed 1337)

Metric Value
Pre-quant post-EMA 1.07610
Int6 quantized 1.08950
Int6 + sliding window (s=64) 1.07266
Int6 + sliding + legal TTT 1.07035
Steps trained 6371 / 20000 (wallclock capped at 4h)
Peak GPU memory 41.8 GiB / 80 GiB (A100)
Model params 35,944,536
Artifact bytes 15,970,123
Total submission 16,019,227 (under 16 MiB)

Hardware equivalence

  • Main leaderboard budget: 8 × H100 × 10 min = 80 H100-minute-equivalent
  • This submission: 1 × A100 × 240 min = ~76–80 H100-minute-equivalent
    (H100 BF16 ≈ 3.17× A100 BF16, plus FA3 is Hopper-only so there's an additional ~1.5× gap we don't close)

Comparison with exp60 / exp61 (same training config, different QK_gain)

Three runs of the same config differing only in QK_GAIN_INIT:

Run QK_GAIN Quant Sliding TTT
exp60 5.0 1.09031 1.07345 1.07137
exp61 5.0 1.09031 1.07345 1.07137
exp62 5.25 1.08950 1.07266 1.07035

The SOTA record's non-default QK_GAIN_INIT=5.25 consistently helps all three quant/eval phases, confirming the paper's "monotonic improvement from 4.0 to 5.25" observation.

Caveats

  • Single seed (1337). A 3-seed mean has not been run for time reasons.
  • exp60 and exp62 both crashed with SIGSEGV at the end of their own eval pipelines (torch.compile recompile issue when creating a fresh GPT instance for eval after training). The saved quantized artifacts were then evaluated successfully via a standalone eval_only.py script. exp61 completed its full eval pipeline natively.
  • grad_accum=2 variant (exp63/64) OOM'd at startup: SOTA model with MLP 4× + depth recurrence has a per-micro-batch footprint that doesn't fit on A100 at quarter the accum.

Test plan

  • Training completes within 4h wallclock cap
  • Quantization + sliding window eval produces valid int6 artifact under 16 MiB
  • Sliding window BPB beats upstream SOTA 1.0827 (got 1.07266)
  • TTT BPB beats upstream SOTA 1.0810 (got 1.07035)
  • Artifact round-trip through brotli + byte-shuffle without errors
  • 3-seed reproduction (not done)

Files:

  • README.md with full recipe + numbers + reproduction commands
  • submission.json with structured metadata
  • train_gpt.py — A100-adapted SOTA script
  • final_model.int6.ptz (15.97 MB)
  • train_seed1337.log + eval_seed1337.log
  • requirements.txt

Longer-context + longer-training variant of the ValCalib_GPTQ_XSA_BigramHash3072
stack. Moves TRAIN_SEQ_LEN 1024 -> 2048 and runs for 4h on 1x A100 (no H100
available), which together bring sliding-window int6 BPB from 1.1317 (s1024, 2h)
down to 1.11044406 (s2048, 4h).

Non-record because the submission was trained on 1x A100 for 240 minutes
(roughly equivalent to 76-80 H100-minutes, close to the 80 H100-minute official
budget) rather than on the required 8xH100 x 10min hardware.

Artifact: 15.94 MB int6+lzma, total submission 16.04 MB (under 16 MiB limit).
Model: 27M params, 11L 512d 3xMLP, XSA-all, BigramHash(2048), PartialRoPE(16/64),
LN Scale, SmearGate, Muon+AdamW WD=0.04, EMA(0.997 deferred), SWA, Late QAT@0.15,
Int6 GPTQ with self-generated AR calibration, LZMA preset=9, sliding window
eval stride=64.

Currently single-seed (1337). Seeds 42 and 999 are running and will be added
to submission.json once complete.
Copilot AI review requested due to automatic review settings April 10, 2026 18:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record leaderboard submission under track_non_record_16mb for an 11-layer full-stack model trained at seq_len=2048 for 4h on 1×A100, reporting val_bpb=1.11044406 with int6 GPTQ + LZMA and sliding-window eval (stride=64).

Changes:

  • Adds the full submission bundle (training script, run log, metadata JSON, README, requirements) for 2026-04-10_s2048_4h_1xA100_1.1104.
  • Updates the training script for A100 environments (FA2/SDP attention fallback, deferred EMA start, Python 3.9 compatibility).
  • Records reported metrics, artifact sizes, and reproduction instructions.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_gpt.py Training/eval/quantization script used to produce the submission artifact and metrics
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_seed1337.log Captured run log with reported metrics and byte sizes
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/submission.json Structured metadata for the submission (metrics, sizes, config)
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/README.md Human-readable summary, numbers, and reproduction command
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/requirements.txt Minimal dependency list for reproducing the run

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1158 to +1162
seq_len = eval_seq_len or args.train_seq_len
total_tokens = val_tokens.numel() - 1
window_starts = [ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= 1]
total_windows = len(window_starts)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_val_sliding currently includes window starts all the way to total_tokens, which creates short tail windows (wlen < seq_len). For ws>0 these tail windows score tokens that were already scored by the last full window, slightly over-weighting the end of the validation set and contradicting the “every token scored exactly once” sliding-window definition used elsewhere (e.g. the TTT window_starts filter in this file). Consider restricting window_starts to full windows (ws <= total_tokens - seq_len) and/or filtering with wlen >= stride or ws == 0 to avoid double-counting.

Copilot uses AI. Check for mistakes.
Comment on lines +2056 to +2058
log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}")
log0(f"attn_backend:{_ATTN_BACKEND} sdp_backends:cudnn=False flash=True mem_efficient=False math=False")
log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}")
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logged SDP backend flags are hard-coded (mem_efficient=False), but earlier you call enable_mem_efficient_sdp(True). This makes the run metadata in train_seed1337.log misleading; please either query the actual backend settings or update the log string to match what is enabled.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +20
try:
import zstandard
_COMPRESSOR = "zstd"
except ImportError:
_COMPRESSOR = "zlib"
import numpy as np
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional zstandard import and _COMPRESSOR selection appear unused (no references elsewhere in this script), while the submission path always uses LZMA. This adds dead code and an unnecessary dependency surface; consider removing _COMPRESSOR/zstandard/zlib if they’re not meant to be toggled, or wire them into an actual compressor option.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +26
"val_bpb_int6_roundtrip": 1.13437381,
"artifact_bytes": 15941100,
"total_submission_bytes": 16040603,
"steps": 14065,
"step_avg_ms": 1023.86
}
},
"artifact_bytes_max": 15941100,
"bytes_total": 16040603,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

artifact_bytes / artifact_bytes_max don’t match the sizes in the included train_seed1337.log. The log reports Serialized model int6+lzma: 15920436 bytes and Total submission size ...: 16040603 bytes, implying artifact_bytes should be 15920436 (and code bytes ~120167), not 15941100. Please recompute these fields from the actual final_model.int6.ptz and script size so metadata stays self-consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +35
| Peak memory | 16.3 GiB |
| Model params | 26,993,756 |
| Artifact bytes (int6+lzma) | 15,941,100 |
| **Total (code + artifact)** | **16,040,603** (under 16 MiB = 16,777,216) |

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README’s artifact/total byte counts appear inconsistent with the included training log. train_seed1337.log reports Serialized model int6+lzma: 15920436 bytes and Code size: 120167 bytes (total 16040603), but this README lists Artifact bytes ... 15,941,100. Please update the README numbers to match the actual generated files (or regenerate the log/README from the same run) so readers can verify the 16 MiB constraint.

Copilot uses AI. Check for mistakes.
Replaces the earlier 1.1104 non-record submission with a much stronger
result that reproduces the PR openai#1493 SOTA 1.0810 recipe on 1xA100 for 4h
instead of the required 8xH100 for 10min.

Key numbers (seed 1337):
- Int6 Sliding Window: 1.07266 BPB (beats upstream SOTA 1.0827 by -0.0100)
- Int6 + Legal TTT:    1.07035 BPB (beats upstream SOTA 1.0810 by -0.0107)
- Pre-quant post-EMA:  1.07610 BPB
- Steps trained: 6371 (wallclock capped at 4h)
- Total submission: 16,019,227 bytes (under 16 MiB)

This is the exact PR openai#1493 SOTA recipe (SP8192 + 3-layer recurrence +
parallel residuals layer 7+ + QK-Gain 5.25 + MuonEq-R + SDClip GPTQ +
Brotli + byte shuffle + legal score-first TTT) with three A100 adaptations:

1. FA3 -> PyTorch SDP fallback with manual GQA head-repeat (A100 doesn't
   support FA3)
2. Python 3.9 compatibility (removed zip(strict=True) and nested
   double-quoted f-strings)
3. GRAD_ACCUM_STEPS env override for single-GPU runs

Three seeds of the same config ran (exp60, exp61, exp62). exp60/62
crashed in their own eval phase with a torch.compile recompile issue
when creating a fresh GPT instance after training; the saved quantized
artifacts were then evaluated successfully via a standalone eval_only.py
script. exp62 (QK_GAIN_INIT=5.25, the exact SOTA record value) beat
exp60/exp61 (QK_GAIN_INIT=5.0, the script default) consistently across
quant/sliding/TTT metrics, matching the "monotonic improvement from 4.0
to 5.25" observation in the SOTA paper.

Still single-seed; 3-seed mean is not yet run due to time constraints.
@xiehuanyi xiehuanyi changed the title Non-record: 11L s2048 4h on 1xA100 — 1.1104 BPB Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT) Apr 11, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)

BPB: 1.07035 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA c01e4dac462a, file records/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/train_gpt.py):

The TTT path at line 356 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 12.31s, dim=512, layers=11, vocab=8192, code=49104 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 12.31s, dim=512, layers=11, vocab=8192, code=49104 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants